Can Corpus Pattern Analysis Be Used in NLP?
نویسندگان
چکیده
Corpus Pattern Analysis (CPA) [1], coined and implemented by Hanks as the Pattern Dictionary of English Verbs (PDEV) [2], appears to be the only deliberate and consistent implementation of Sinclair’s concept of Lexical Item [3]. In his theoretical inquiries [4] Hanks hypothesizes that the pattern repository produced by CPA can also support the word sense disambiguation task. Although more than 670 verb entries have already been compiled in PDEV, no systematic evaluation of this ambitious project has been reported yet. Assuming that the Sinclairian concept of the Lexical Item is correct, we started to closely examine PDEV with its possible NLP application in mind. Our experiments presented in this paper have been performed on a pilot sample of English verbs to provide a first reliable view on whether humans can agree in assigning PDEV patterns to verbs in a corpus. As a conclusion we suggest procedures for future development of PDEV. 1 Corpus Pattern Analysis 1.1 What Is a Lexical Item? John Sinclair, the Nestor of corpus linguistics, criticized the separation of grammar and lexicon in the sense that the grammar (in extreme cases) only describes the form of a lexical item with respect to its potential context, while the lexicon primarily describes the meaning comprised by its base form, regardless of the context. Not only are form and meaning tightly related, Sinclair argues [3, p. 59f.], they must even be identical, considering that most ambiguities are resolved by context in authentic language usage. Hence, a description of lexical items should take into account both aspects at the same time. Instead of describing the paradigmatic properties of each lexical item by listing the potential senses of its lemma, he pleads for describing both the syntagmatic and the paradigmatic properties of each lexical item as patterns in which the given lexical item occurs [3, p. 69]. 1.2 Pattern Dictionary of English Verbs (PDEV) Hanks, Sinclair’s collaborator on the first corpus-based dictionary ever, the Collins Cobuild English Language Dictionary [5], has proposed Corpus Pattern Analysis P. Sojka et al. (Eds.): TSD 2010, LNAI 6231, pp. 67–74, 2010. c © Springer-Verlag Berlin Heidelberg 2010 68 S. Cinková et al. (CPA), a semi-formal lexical description method that consistently materializes Sinclair’s concept of capturing meanings in patterns of language rather than lexical units in the token-centered lexicographic tradition. The current CPA captures “normal”, i.e. reasonably frequent, usages of a given verb by sorting them into patterns. Each pattern is formulated as a proposition in which the verb in question is lemmatized1 and its relevant collocates are classified by means of two sets of semantic labels or listed as lexical sets, depending on whether the respective collocates can be listed (as a lexical set) or grouped together under the general heading of a Semantic Type. Each proposition is paraphrased by a sentence in which the relevant pattern arguments are labeled identically with the proposition part. This paraphrase embodies the implicature (or meaning potential, see [1]) activated by that particular pattern. Each collocate that cannot be represented by a lexical set, is described by a Semantic Type. Semantic types are sometimes augmented by a Semantic Role. The Semantic Types are a finite set of labels hierarchically ordered in what Hanks calls a shallow semantic ontology [2]. The Semantic Types describe inherent properties of the collocates, such as Human, Artifact, Stuff, Document. The Semantic Roles describe the properties that are assigned to the word in a particular pattern or context. CPA is implemented as PDEV, the Pattern Dictionary of English Verbs, built by Hanks and his collaborators [2]. It comprises two interlinked components: a list of patterns for each verb and a reference set of manually tagged sample data. Each verb in PDEV is linked to a reference sample of concordances, which contain the verb in question. The sample is randomly selected from the British National Corpus (BNC) [6], and its size is typically 250–500, depending on the semantic complexity of the verb. We perceive Hanks’ patterns as a means of discrimination of Sinclairian lexical items, which, in their own right, imply what is usually referred to as “meaning”. To the best of our knowledge, PDEV is the first real and conscious implementation of Sinclair’s principles concerning the lexical item and the way it should be described. This fact makes PDEV unique, yet there are certainly a number of other projects that formally describe semantic distinctions of verb uses, with different theoretical foundations, e.g. [7,8]. 1.3 PDEV as a Source for NLP? Hanks’ approach to the lexical description of verbs is novel and linguistically sound at the same time. It has gained a world-wide reputation, judging by the more than 6002 topic-related citations for Hanks, as well as the numerous keynote speeches Hanks has been invited to give on this subject since the first significant mention of CPA in [1]. CPA is intuitively plausible, and its formal encoding appears promising for various applications in NLP – the more so because Hanks has been continuously linking his lexicon to other well-known lexical sources, such as FrameNet [7] or the Erlangen Valency Bank [9]. 1 Exception: passivization. 2 Harzing’s Publish or Perish since Hanks, 1994 (recorded as 1993), quoted 2010-03-24. Can Corpus Pattern Analysis Be Used in NLP? 69 However, the “qualified judgment” on the hypothesized NLP usability of CPA pronounced by a number of language experts has not yet been experimentally tested. With our experiments we are taking a first step towards providing a reliable assessment whether or not the current PDEV is suitable for NLP application. In this short paper we report on an on-going pilot study, in which we examine the consistency of PDEV, which we regard as the basic prerequisite for its NLP-usability. Should we identify problematic issues, we suggest (and plan to implement) improvements based on a pilot sample in the next step. 2 Current Status of PDEV Development 2.1 Platform of PDEV development The development of PDEV is supported by two interconnected applications. The first one used for pattern editing is based on the “Dictionary Editor and Browser” tool (DEB), a dictionary-making database platform developed at the Masaryk University in Brno (MU), Czech Republic [10]. This platform enables the lexicographic processing of XML-encoded data through a user-friendly web-based graphical user interface integrated as an add-on in the Firefox-Mozilla web browser. The data is stored on DEB servers located at MU. PDEV is one of the numerous applications of DEB. It incorporates a tailored interface for pattern creating, ontology browsing and editing. The second application, used for concordance tagging, is a modified version of the Sketch Engine [11]. 2.2 Current PDEV Statistics PDEV has been developed on basis of verb occurrences in the BNC50 corpus, a 50million-word part of the BNC. BNC50 contains almost 5,800 verb types occurring in 8 million verb tokens. However, about 41% of all verb tokens represent auxiliary (‘will’, ‘do’, ‘have’, ‘be’) or modal (‘shall’, ‘can’, ‘must’, etc.) verbs that are not analysed in the PDEV project at all. The number of lexical verb types in BNC50 is 5,757 and the total number of the corresponding tokens is 4,673,003. Table 1 illustrates the well known fact that rare words do not significantly contribute to corpus coverage. Verbs with frequency higher than 27 cover the corpus up to 99.5%. Currently (March 2010) the number of verbs compiled in PDEV is 678, 11.8% of all lexical verb types in BNC50. The number of corresponding tokens in BNC50 is 495,724, which cover 10.6% of all BNC50 lexical verb tokens. Table 1. The coverage of BNC50 verb tokens. For example, 918 most frequent verbs, each of which occurs at least 610 times in BNC50, cover more than 90% of all BNC50 lexical verb tokens. min. frequency 54,872 8,723 610 246 136 90 48 28 1 verb types 7 120 918 1,519 2,030 2,452 3,151 3,780 5,757 BNC50 coverage 11% 50% 90% 95% 97% 98% 99% 99.5% 100% 70 S. Cinková et al. The number of all patterns created for those compiled verbs is 2,572. While the average number of patterns per compiled verb type is 3.79, a more interesting value, the expected number of patterns per token is 9.72 (more frequent verbs often have also more patterns). The correlation between verb frequency and the number of patterns is shown in Fig. 1. Fig. 1. The number of patterns of 502 PDEV verbs (with frequency at least 28) and their frequency in BNC50 3 First Evaluation of PDEV 3.1 Evaluation Method PDEV with its tagged reference samples can be regarded as a manually created gold standard data set for machine-learning experiments. So far, the lexicon has mainly been built by Hanks. In terms of annotation, the entire data available has been annotated by one single annotator. Moreover, the author of the patterns and the data annotator are the same person. Our first question was therefore: are humans who did not create the entries themselves able to agree in pattern assignment? A reasonable degree of interannotator agreement is a prerequisite for any further automatic processing. This assumption has two aspects, which we want to keep apart: creating the lexicon and annotating the data. Here we focus only on the consistency in tagging the data according to already existing patterns. We regard the mutual agreement of independently working annotators as a measure of quality of each given lexical entry. As with any linguistically rich annotation, the annotators must be clearly instructed and trained before the interannotator agreement can be measured. The authors of this paper, who acted as annotators, have only learned details of the annotation procedure on-the-fly while discussing the patterns as well as own data findings with Hanks, watching him work and having ocassional hands-on experience with creating a new entry for more than one year. No detailed annotation guidelines were available at that point. We expected this fact to lower our inter-annotator agreement. While tagging, we kept each a record of difficult decisions for future reference when a regular annotation guide for new annotators is being created, and we analyzed our records along with the Can Corpus Pattern Analysis Be Used in NLP? 71 annotated data when all samples were finished. Hanks performed the same annotation with us, and his sample annotation served as a reference in case of doubt.
منابع مشابه
پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملAn empirical approach to Lexical Tuning
NLP systems crucially depend on the knowledge structures devoted to describing and representing word senses. Although automatic Word Sense Disambiguation (WSD) is now an established task within empirically-based computational approaches to NLP, the suitability of the available set (and granularity) of senses is still a problem. Application domains exhibit speci c behaviors that cannot be fully ...
متن کاملIrony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing
The ability to reliably identify sarcasm and irony in text can improve the performance of many Natural Language Processing (NLP) systems including summarization, sentiment analysis, etc. The existing sarcasm detection systems have focused on identifying sarcasm on a sentence level or for a specific phrase. However, often it is impossible to identify a sentence containing sarcasm without knowing...
متن کاملNatural Language Processing in Game Studies Research: An Overview
Natural language processing (NLP) is a field of computer science and linguistics devoted to creating computer systems that use human (natural) language as input and/or output. The authors propose that NLP can also be used for game studies research. In this article, the authors provide an overview of NLP and describe some research possibilities that can be explored using NLP tools and techniques...
متن کاملSentiment Summerization and Analysis of Sindhi Text
Text corpus is important for assessment of language features and variation analysis. Machine learning techniques identify the language terms, features, text structures and sentiment from linguistic corpus. Sindhi language is one of the oldest languages of the world having proper script and complete grammar. Sindhi is remained less resourced language computationally even in this digital era. Vie...
متن کامل